Primary Objective:
To Predict the future medical health expense.
Secondary Objectives:
To Check the relationship between sex and smoker.
To Identify the factors affecting the medical expenses.
To Understand the impact of gender, number of children and region on medical expenses.
Everyone’s life revolves around their health. Good health is essential to all aspects of our lives. Health refers to a person’s ability to cope up with the environment on a physical, emotional, mental, and social level.Because of the quick speed of our lives, we are adopting many habits that are harming our health. One spends a lot of money to be healthy by participating in physical activities or having frequent health check-ups to avoid being unfit and get rid of health disorders. When webecome ill we tend to spend a lot of money, resulting in a lot of medical expenses.So, an application can be made which can make people understand the factors which are making them unfit, and creating a lot of medical expenses, and it could identify and estimate medical expense if someone has such factors.
The dataset contains 1338 rows and 7 columns. The columns present in the dataset are ‘age’,‘sex’, ‘bmi’, ‘children’, ‘smoker’, ‘region’, and ‘expenses’. The expenses column is the target column and the rest others are independent columns. Independent columns are those which will predict the outcome.
Age: The first column is Age. Age is an important factor for predicting medical expenses because young people are generally more healthy than old ones and the medical expenses for Young People will be quite less as compared to old people.
Sex: The Next column is Sex, Which has two Categories in this column: Male and Female. The sex of the person can also play a vital role in predicting the medical expenses of a subject.
bmi: After that, you have the bmi column, then BMI is Body Mass Index. For most adults, an ideal BMI is in the 18.5 to 24.9 range. For children and young people aged 2 to 18, the BMI calculation takes into account age and gender as well as height and weight. If your BMI is less than 18.5, you are considered underweight. People with very low or very high ‘bmi’ are more likely to require medical assistance, resulting in higher costs.
Children: The forth column is the ‘children’ column, which contains information on how many children your patients have. Persons who have children under more pressure because of their children’s education, and other needs than people who do not have children.
Smoker: The fifth is the ‘smoker’ column. The Smoking factor is also considered to be one of the Most Important factors as the people who smoke are always at risk when their age reaches 50 to 60.
Region: Next is the ‘region’ column. Some Regions are Hygienic, Clean, Neat, and Prosperous, But some Regions are not, and this information affects health which is related to medical expenses.
Expenses: The last column is ‘expenses’, which is target column and the rest others are independent columns. Individual medical costs billed by health insurance. Independent columns are those which will predict the outcome
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
insurance = pd.read_csv('F:\\Aishwarya\\insurance.csv')
Check out the info(), head(), and describe() methods on insurance.
insurance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
insurance.describe()
| age | bmi | children | charges | |
|---|---|---|---|---|
| count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
| mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
| std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
| min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
| 25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
| 50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
| 75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
| max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
insurance.head()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
y = np.array([274, 1064])
mylabels = ["Yes", "No"]
myexplode = [0.2, 0]
plt.pie(y, labels = mylabels, explode = myexplode, shadow = True, autopct='%1.2f%%')
plt.show()
mylabels= ['northeast', 'northwest', 'southeast', 'southwest']
mycolors=['yellow', 'blue', 'gray', 'hotpink']
sizes= [324, 325, 364, 325]
plt.pie(sizes,labels=mylabels, colors=mycolors, startangle=90, shadow=True,explode=(0.1, 0.1, 0.1, 0.1), autopct='%1.2f%%')
#plt.axis('equal')
plt.show()
x = [0,1,2,3,4,5]
y = [574,324,240,157,25,18]
c=['hotpink', 'gray', 'black', 'purple', 'orange']
plt.bar(x, height = y, color = c)
plt.legend()
plt.ylabel('Count')
plt.xlabel('Children')
plt.show()
No handles with labels found to put in legend.
From above diagram, In smoker column we have to found 20.48% of the subjects are smokers and 79.52% are non-smoker.
Using count plot we have shown the subjects having children ranging from 0 to 5 and it has been computed and observed from the count plot also that those who are having no children are highest in number.
We have again used a pie chart to plot the number of inhabitants in the region column which consists of four segments: Northeast, Northwest, Southeast, Southwest. The number of Southwest and Northwest are the same and the value is 324, but the number of inhabitants in Northeast and Southeast are respectively 324 and 364.
sns.distplot(insurance['age'])
C:\Users\Hema\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='age', ylabel='Density'>
sns.distplot(insurance['bmi'])
C:\Users\Hema\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='bmi', ylabel='Density'>
In the above diagram we conclude that, We have an equal number of people of all ages.
And the BMI of the patients seems to be normally distributed where maximum people have BMI around 30 and very few people have less BMI around 10, similarly very few people have high BMI around 60.The given distribution is right- skewed.
sns.set(style='whitegrid')
f, ax = plt.subplots(1,1, figsize=(12, 8))
ax = sns.distplot(insurance['charges'], kde = True, color = 'c')
plt.title('Distribution of Charges')
C:\Users\Hema\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Distribution of Charges')
f, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.distplot(np.log10(insurance['charges']), kde = True, color = 'r' )
C:\Users\Hema\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
import plotly.express as px
fig = px.scatter(insurance, x="age", y='charges')
fig.show()
In above diagram, It has been noticed that with an increase in age the medical expenses have increased but some people are of higher age but have lower medical expenses. And In the above figure trend line shows the expense and age have a linear relationship
fig = px.scatter(insurance, x="bmi", y='charges')
fig.show()
In above diagram, it has been noticed that most of the datais situated at the bottom of the trend line indicating that people with high, low and medium BMI can have low expenses, which is an irregular pattern, but if we takea look at the trend line we can notice that people with high BMI, their medical expense will be high, so we can conclude that for people with high BMI the expense may be increased but in rare case
fig = px.box(insurance, y="charges", x='children')
fig.show()
fig = px.box(insurance, y="charges", x='smoker')
fig.show()
a)From above figure, it has been noticed people with more number of children the charges is more as with more children parents need to take care of the health of all of them rather those who have no children or one child, but as there is very less number of people who are having more than 3 children so for them the charges is almost same. The people with 3 children have the highest charges among them.
b) In above figure, we see that the relationship between smokers and charges through box plots and it has been noticed that for smokers the charges is much high than nonsmokers as it is obvious because smoking is injurious to health so smokers are likely to have health issues than nonsmokers causing their medical charges to increase.
fig = px.scatter(insurance, x="charges", y='bmi', color='smoker', size='bmi')
fig.show()
It is noticed from the chart that BMI is not a powerful factor as people having less BMI also have high medical expenses and it is very clear from the chart that people who smoke definitely have high medical expenses. Therefore, the size of the bubble indicates age and it has been noticed that with an increase in bubble size that means with an increase in age medical expenses increase.
Linear Regression was applied to predicted Future Medical Expenses for your Patients based on certain features such as Age, Gender, Region, Smoking Behavior, and Number of children. We computed R2 and RMSE values which were obtained as 0.79 and 5673.09 which means 79 % of the variation of target column expense can be well explained by one of the predictor variables. And RMSE interprets that the expected expense can be more or less than 5673 of the actual expense.
insurance
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | male | 30.970 | 3 | no | northwest | 10600.54830 |
| 1334 | 18 | female | 31.920 | 0 | no | northeast | 2205.98080 |
| 1335 | 18 | female | 36.850 | 0 | no | southeast | 1629.83350 |
| 1336 | 21 | female | 25.800 | 0 | no | southwest | 2007.94500 |
| 1337 | 61 | female | 29.070 | 0 | yes | northwest | 29141.36030 |
1338 rows × 7 columns
insurance ['sex'] = insurance ['sex'].replace(['female', 'male'], ['1', '2'])
insurance ['smoker'] = insurance ['smoker'].replace(['yes', 'no'], ['1', '2'])
insurance ['region'] = insurance ['region'].replace(['southwest', 'southeast', 'northwest', 'northeast'], ['1', '2', '3', '4'])
insurance
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | 1 | 27.900 | 0 | 1 | 1 | 16884.92400 |
| 1 | 18 | 2 | 33.770 | 1 | 2 | 2 | 1725.55230 |
| 2 | 28 | 2 | 33.000 | 3 | 2 | 2 | 4449.46200 |
| 3 | 33 | 2 | 22.705 | 0 | 2 | 3 | 21984.47061 |
| 4 | 32 | 2 | 28.880 | 0 | 2 | 3 | 3866.85520 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1333 | 50 | 2 | 30.970 | 3 | 2 | 3 | 10600.54830 |
| 1334 | 18 | 1 | 31.920 | 0 | 2 | 4 | 2205.98080 |
| 1335 | 18 | 1 | 36.850 | 0 | 2 | 2 | 1629.83350 |
| 1336 | 21 | 1 | 25.800 | 0 | 2 | 1 | 2007.94500 |
| 1337 | 61 | 1 | 29.070 | 0 | 1 | 3 | 29141.36030 |
1338 rows × 7 columns
X = insurance[['age','sex','bmi','region','smoker','children']]
Y = insurance['charges']
import statsmodels.api as sm
np.asarray(insurance)
array([[19, '1', 27.9, ..., '1', '1', 16884.924],
[18, '2', 33.77, ..., '2', '2', 1725.5523],
[28, '2', 33.0, ..., '2', '2', 4449.462],
...,
[18, '1', 36.85, ..., '2', '2', 1629.8335],
[21, '1', 25.8, ..., '2', '1', 2007.945],
[61, '1', 29.07, ..., '1', '3', 29141.3603]], dtype=object)
model = sm.OLS(Y, X.astype(float)).fit()
# adding the constant term
x = sm.add_constant(X)
# performing the regression
# and fitting the model
result = sm.OLS(Y, X.astype(float)).fit()
# printing the summary table
print(result.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: charges R-squared (uncentered): 0.835
Model: OLS Adj. R-squared (uncentered): 0.834
Method: Least Squares F-statistic: 1120.
Date: Fri, 26 Aug 2022 Prob (F-statistic): 0.00
Time: 07:39:16 Log-Likelihood: -13802.
No. Observations: 1338 AIC: 2.762e+04
Df Residuals: 1332 BIC: 2.765e+04
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
age 331.9462 13.890 23.898 0.000 304.697 359.195
sex 2961.3196 372.707 7.945 0.000 2230.163 3692.476
bmi 743.7121 26.822 27.728 0.000 691.095 796.329
region 1722.0898 170.978 10.072 0.000 1386.675 2057.505
smoker -1.818e+04 414.670 -43.851 0.000 -1.9e+04 -1.74e+04
children 796.4865 165.574 4.810 0.000 471.673 1121.300
==============================================================================
Omnibus: 132.765 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.950
Skew: 0.764 Prob(JB): 4.18e-41
Kurtosis: 3.999 Cond. No. 111.
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\Hema\anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
Here we found for two variables sex and region we got a p-value of more than 0.05 which means these are insignificant, so to confirm this we did a variance inflation factor test on the independent variables to show the multicollinearity.
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.model_selection import train_test_split
from sklearn import metrics
#Train-Test Split 20:80 ratio
X_train, X_test, Y_train, Y_test=train_test_split(X,Y, test_size=0.2, random_state=99)
from sklearn.ensemble import RandomForestRegressor as rfr
x = insurance.drop(['charges'], axis=1)
y = insurance.charges
Rfr = rfr(n_estimators = 100, criterion = 'mse',
random_state = 1,
n_jobs = -1)
Rfr.fit(X_train,Y_train)
x_train_pred = Rfr.predict(X_train)
x_test_pred = Rfr.predict(X_test)
print('MSE train data: %.3f, MSE test data: %.3f' %
(metrics.mean_squared_error(x_train_pred, Y_train),
metrics.mean_squared_error(x_test_pred, Y_test)))
print('R2 train data: %.3f, R2 test data: %.3f' %
(metrics.r2_score(Y_train,x_train_pred, Y_train),
metrics.r2_score(Y_test,x_test_pred, Y_test)))
MSE train data: 3366647.434, MSE test data: 25386563.235 R2 train data: 0.974, R2 test data: 0.793
C:\Users\Hema\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\validation.py:70: FutureWarning:
Pass sample_weight=236 1615.76670
681 1242.26000
1117 36124.57370
171 8116.68000
602 11070.53500
...
1092 3591.48000
1192 13019.16105
1209 12347.17200
1059 4462.72180
641 32787.45859
Name: charges, Length: 1070, dtype: float64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
C:\Users\Hema\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\validation.py:70: FutureWarning:
Pass sample_weight=156 21223.67580
123 39556.49450
1269 8615.30000
512 9361.32680
377 38126.24650
...
101 3645.08940
939 9487.64420
22 1137.01100
801 14313.84630
667 40003.33225
Name: charges, Length: 268, dtype: float64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
plt.figure(figsize=(8,6))
plt.scatter(x_train_pred, x_train_pred - Y_train,
c = 'gray', marker = 'o', s = 35, alpha = 0.5,
label = 'Train data')
plt.scatter(x_test_pred, x_test_pred - Y_test,
c = 'blue', marker = 'o', s = 35, alpha = 0.7,
label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Actual values')
plt.legend(loc = 'upper right')
plt.hlines(y = 0, xmin = 0, xmax = 60000, lw = 2, color = 'red')
<matplotlib.collections.LineCollection at 0x20fa7a420a0>
print('Feature importance ranking\n\n')
importances = Rfr.feature_importances_
std = np.std([tree.feature_importances_ for tree in Rfr.estimators_],axis=0)
indices = np.argsort(importances)[::-1]
variables = ['age', 'sex', 'bmi', 'children','smoker', 'region']
importance_list = []
for f in range(x.shape[1]):
variable = variables[indices[f]]
importance_list.append(variable)
print("%d.%s(%f)" % (f + 1, variable, importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(importance_list, importances[indices],
color="y", yerr=std[indices], align="center")
Feature importance ranking 1.smoker(0.613047) 2.bmi(0.207488) 3.age(0.139971) 4.region(0.020527) 5.children(0.012562) 6.sex(0.006406)
<BarContainer object of 6 artists>
Random Forest is an ensemble learning method for classification and regression by constructing multiple numbers of decision trees at training time and Outputting the average prediction of the individual trees in case of regression whereas outputting the class that is the mode of the classes in case of classification.
It is one of the Most powerful Machine Learning algorithms which works well in most cases.
First of all, the RandomForestRegressor package was imported from sklearn ensemble library, so that we can use this model to predict the Expenses.
After that, we specified a Model using this Random Forest Regressor Class.
Now, as the Model is Ready, trained our Model using the Training data, for that fit function, was used and used the training data.
Here, the Training Data refers to x_train and y_train. Where x_train is the independent variable, and Y_train is the dependent variable or target variable.
After the Model gets trained, we started performing Predictions using this Predictive Model created using the Random Forest Regressor.
To do that, used predict function and specified the independent variables inside the function, to get the predictions and saved the result produced by the Random forest in a new variable, so that we can compare the Results later if required.
After building the predictive model, evaluated the Model using various Performance Metrics.
In this case, we checked the R2 score and RMSE Score as in the case of the linear regressor. In the case of the Random Forest model, the RMSE score comes out to be 0.4188 whereas the r2 score comes out to be 0.789 which makes it clear that Random Forest works much better than Linear Regression for this Dataset
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train,Y_train)
print("Score the X-train with Y-train is : ", gbr.score(X_train,Y_train))
print("Score the X-test with Y-test is : ", gbr.score(X_test,Y_test))
y_pred = gbr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(Y_test),np.log(y_pred)))
Score the X-train with Y-train is : 0.9096770710241873 Score the X-test with Y-test is : 0.8285243611660894 MSE: 0.2327093947921688
Gradient Boosting is a very popular Boosting technique.
It works by sequentially adding the previous predictors under fitted predictions to the ensemble ensuring that the errors made previously are corrected.
Random Forest is also an Ensemble, But In Random Forest the Ensembling happens Parallelly,
But, In the case of Gradient Boosting, the Ensembling happens Sequentially, which means that the First Model’s Errors will be used to Build the Second Model and the Second Model’s Error will be used to Build the Third Model, and so on. The Models will be built until and unless the Errors are optimized in the best way.
That means By using the Gradient Boosting Models we can make the least error possible.
First of all, Using the Gradient Boosting Regressor Model is to be imported from the sklearn.ensemble library.
After that, make a Base Gradient Boosting Regressor Model, and then we trained this Model using the fit function on the Training Data that is x_train and y_train.
Where, X_train is your Independent Variable, and y_train is your Dependent Variable.
After the Model is built we predicted the Target Variable for our test data using the predict function and save the result in a new variable, so that to compare the results later.
After that, we performed Model Evaluation using the R2 score and RMSE score Performance Metrics as did for the Last Two Models and we obtained the RMSE Score as 0.2266, whereas the R2 score comes out to be 0.838
x = ['Linear Regression','Random Forest','Gradient Boost']
y = [0.835,0.789,0.838]
c=['hotpink', 'gray','Blue']
plt.bar(x, height = y, color = c)
plt.legend()
plt.show()
No handles with labels found to put in legend.
We have created a NumPy array of the r2 score of all three models, Linear Regression, Random Forest, and Gradient Boosting.
An array for the labels was also created as well, to compare these Values using bar charts. Here a Rainbow palette has been used and the Bar plot built using the seaborn Library shows a higher r2 score value for Gradient Boosting and lowest for Linear Regression.
That means, the Gradient Boosting Model, is the Best Choice whereas the Linear Regression Model is the worst for this Case thus we have successfully built our predictive model and compared these predictive models based on their accuracies and Results. Below is the bar plot to show the performance of the three used models.
We came to know that the Most Important Factor to Predict the Medical Expenses of a subject is Smoking Behavior and Age, that means, smoking is Bad for Health, as already know that and which inevitably increases medical expenses as due to smoking one is likely to fall ill more than the nonsmokers.
We also found that with increasing of age, one needs to take some more care and precautions for your health as with the increase of age health becomes fragile so they go for frequent medical check-up, likely to fall ill quickly as with the increase of age immunity falls so they adopt measures to stay healthy by taking medicines and engaging in some physical activities like jogging, walking, Yoga which causes an increase of medical expenses.
We have built three models among which the Gradient Boosting Regressor model shows the best result through which we can say 83.2% variability of expenses can well be explained by predictor variables and which yields comparatively low RMSE value so our predicted expense through this model will not vary too much from the actual expense.
Predict the future medical health expenses based on certain features building a robust machine learning model. The value of health forecasting the future of health care in the various states beacaues, front-line health delivery services and providers are not usually adequately informed and do not have adequate resources to meet the needs of a higher than normal demand for health care.